Anomaly Detection: The Local Outlier Factor (LOF) Model

Introductory Remarks

Anomalies are data points that are different from other observations in some way, typically measured against a model fit to the data. On the contrary with the ordinary descriptive statistics, we are interested here to found where these anomalous data points exist and not exclude them as outliers.

We assume the anomaly detection task is unsupervised, i.e. we don’t have training data with points labeled as anomalous. Each data point passed to an anomaly detection model is given a score indicating how different the point is relative to the rest of the dataset. The calculation of this score varies between models, but a higher score always indicates a point is more anomalous. Often a threshold is chosen to make a final classification of each point as typical or anomalous; this post-processing step is left to the user.

The GraphLab Create (GLC) Anomaly Detection toolkit currently includes three models for two different data contexts:

Local Outlier Factor, for detecting outliers in multivariate data that are assumed to be independently and identically distributed,
Moving Z-score, for scoring outliers in a univariate, sequential dataset, typically a time series, and
Bayesian Changepoints for identifying changes in the mean or variance of a sequential series.

In this short note, we demonstrate how the GLC Local Outlier Factor Model can be used to reveal anomalies in a multivariate data set. We will use the customer data from a recent AirBnB New User Bookings competition on Kaggle. More specifically, we have downloaded a copy of the file train_users_2.csv in our working directory. Each row in this dataset describes one of 213,451 AirBnB users; there is a mix of basic features, such as gender, age, and preferred language, as well as the user's "technology profile", including the browser type, device type, and his/her sign-up method.

Libraries and Necessary Data Transformation

First, we fire up GraphLab Create, all the other necessary libraries for our study, and load the train_users_2.csv file in a SFrame.



In [1]:

    
import graphlab as gl
from visualization_helper_functions import *









    



[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466532393.log
INFO:graphlab.cython.cy_server:GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466532393.log






    



This non-commercial license of GraphLab Create is assigned to tgrammat@gmail.com and will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.



In [2]:

    
customer_data = gl.SFrame.read_csv('./train_users_2.csv')









    




Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/train_users_2.csv






    




Parsing completed. Parsed 100 lines in 4.11956 secs.






    




Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/train_users_2.csv






    




Parsing completed. Parsed 213451 lines in 2.96087 secs.






    



------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,str,str,float,str,int,str,str,str,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [3]:

    
customer_data.head(5)









    Out[3]:





    
        id
        date_account_created
        timestamp_first_active
        date_first_booking
        gender
        age
        signup_method
        signup_flow
    
    
        gxn3p5htnn
        2010-06-28
        20090319043255
        
        -unknown-
        None
        facebook
        0
    
    
        820tgsjxq7
        2011-05-25
        20090523174809
        
        MALE
        38.0
        facebook
        0
    
    
        4ft3gnwmtx
        2010-09-28
        20090609231247
        2010-08-02
        FEMALE
        56.0
        basic
        3
    
    
        bjjt8pjhuk
        2011-12-05
        20091031060129
        2012-09-08
        FEMALE
        42.0
        facebook
        0
    
    
        87mebub9p4
        2010-09-14
        20091208061105
        2010-02-18
        -unknown-
        41.0
        basic
        0
    


    
        language
        affiliate_channel
        affiliate_provider
        first_affiliate_tracked
        signup_app
        first_device_type
        first_browser
    
    
        en
        direct
        direct
        untracked
        Web
        Mac Desktop
        Chrome
    
    
        en
        seo
        google
        untracked
        Web
        Mac Desktop
        Chrome
    
    
        en
        direct
        direct
        untracked
        Web
        Windows Desktop
        IE
    
    
        en
        direct
        direct
        untracked
        Web
        Mac Desktop
        Firefox
    
    
        en
        direct
        direct
        untracked
        Web
        Mac Desktop
        Chrome
    


    
        country_destination
    
    
        NDF
    
    
        NDF
    
    
        US
    
    
        other
    
    
        US
    

[5 rows x 16 columns]

For the needs of our current presentation we will only need a small subset of the available basic customer features, i.e. 'gender', 'age' and 'language'.



In [4]:

    
features = ['gender', 'age', 'language']
customer_data = customer_data[['id']+features]
customer_data









    Out[4]:





    
        id
        gender
        age
        language
    
    
        gxn3p5htnn
        -unknown-
        None
        en
    
    
        820tgsjxq7
        MALE
        38.0
        en
    
    
        4ft3gnwmtx
        FEMALE
        56.0
        en
    
    
        bjjt8pjhuk
        FEMALE
        42.0
        en
    
    
        87mebub9p4
        -unknown-
        41.0
        en
    
    
        osr2jwljor
        -unknown-
        None
        en
    
    
        lsw9q7uk0j
        FEMALE
        46.0
        en
    
    
        0d01nltbrs
        FEMALE
        47.0
        en
    
    
        a1vcnhxeij
        FEMALE
        50.0
        en
    
    
        6uh8zyj2gn
        -unknown-
        46.0
        en
    

[213451 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

From the quick exploratory data analysis below:



In [5]:

    
%matplotlib inline
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)









    












    



Summary Statistics:

           gender            age language
count      213451  125461.000000   213451
unique          4            NaN       25
top     -unknown-            NaN       en
freq        95688            NaN   206314
mean          NaN      49.668335      NaN
std           NaN     155.666612      NaN
min           NaN       1.000000      NaN
25%           NaN      28.000000      NaN
50%           NaN      34.000000      NaN
75%           NaN      43.000000      NaN
max           NaN    2014.000000      NaN



In [6]:

    
gl.canvas.set_target('browser')
customer_data[['age']].show()









    



Canvas is accessible via web browser at the URL: http://localhost:34510/index.html
Opening Canvas in default web browser.



In [7]:

    
print 'Number of customer records with ages larger than 2013: %d' %\
len(customer_data[customer_data['age'] >= 2013])









    



Number of customer records with ages larger than 2013: 749

we notice that there about 750 records having an 'age' value of '2013' or '2014', which is of course wrong. Most probably the year was recorded accidentally in this field. The remaining 'age' values seams absolutely reasonable with only some rare customer entries that have ages greater than '100'. In fact more than 128 thousand customer entries are found to have ages in the [1, 142] interval. More specifically, we have choosen to assume any value falling in the [1,150] interval as an elligible recording of a customer age, re-assigning all the remaining ones as missing:



In [8]:

    
customer_data['age'] = customer_data['age'].apply(lambda age: age if age < 150 else None)
customer_data = customer_data.dropna(columns = features, how='any')
print 'Number of Rows in dataset: %d' % len(customer_data)









    



Number of Rows in dataset: 124681

Now, the univariate summary statistics of the customer_data set takes the form:



In [9]:

    
univariate_summary_plot(customer_data, features, nsubplots_inrow=3, subplots_wspace=0.7)









    












    



Summary Statistics:

        gender            age language
count   124681  124681.000000   124681
unique       4            NaN       25
top     FEMALE            NaN       en
freq     57247            NaN   120173
mean       NaN      37.412629      NaN
std        NaN      13.954917      NaN
min        NaN       1.000000      NaN
25%        NaN      28.000000      NaN
50%        NaN      34.000000      NaN
75%        NaN      43.000000      NaN
max        NaN     132.000000      NaN

and more specifically the remaining customer ages follow the distribution below:



In [10]:

    
# transform the SFrame into a Pandas DataFrame
customer_data_df = customer_data.to_dataframe()
customer_data_df['gender'] = customer_data_df['gender'].astype(str)
customer_data_df['age'] = customer_data_df['age'].astype(float)
customer_data_df['language'] = customer_data_df['language'].astype(str)



In [12]:

    
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# initialize the matplotlib figure
plt.figure(figsize=(12,7))

# draw distplot
ax1 = sns.distplot(customer_data_df.age, bins=None, hist=True, kde=False, rug=False, color='b')

If we would like to explore in more detail the countplot for the variable language, we can temporarily exclude the english-speaking customers and redraw the graph:



In [13]:

    
# exclude the english-speaking customers
customer_data_df_nen = customer_data_df[customer_data_df['language']!='en']

# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=False)
# initialize the matplotlib figure
plt.figure(figsize=(7,11))
plt.ylabel('language', {'fontweight': 'bold'})
plt.title('Countplot of Customer Languages\n[English-speaking people excluded]',
          {'fontweight': 'bold'})

# draw countplot
ax2 = sns.countplot(y='language', data=customer_data_df_nen, palette='deep', color='b')

The univariate summary statistics plot for this new customer_data_df_nen set is as follows.



In [14]:

    
univariate_summary_plot(customer_data_df_nen, features, subplots_wspace=0.7)









    












    



Summary Statistics:

        gender          age language
count     4508  4508.000000     4508
unique       4          NaN       24
top     FEMALE          NaN       zh
freq      2152          NaN      912
mean       NaN    33.222050      NaN
std        NaN    13.060345      NaN
min        NaN     5.000000      NaN
25%        NaN    25.000000      NaN
50%        NaN    30.000000      NaN
75%        NaN    38.000000      NaN
max        NaN   110.000000      NaN

The data set of interest, customer_data, has two nominal categorical variables:

'gender': nominal categorical attribute (FEMALE/MALE/unknown/OTHER)
'language': nominal categorical attribute of 25 different languages.

which we should better encode them prior of applying any learning algorithm. To do so we will apply the OneHotEncoding transformation as shown below:



In [15]:

    
one_hot_encoder = gl.toolkits.feature_engineering.OneHotEncoder(features=['gender', 'language'])
customer_data1 = one_hot_encoder.fit_transform(customer_data)

Local Outlier Factor (LOF) Models are distance-based learning algorithms. Therefore, we need to standardize the 'age' feature in order to be on roughly the same scale as the encoded categorical variables.



In [16]:

    
customer_data1['age'] = (customer_data['age'] - customer_data['age'].mean())/\
customer_data['age'].std()
customer_data1









    Out[16]:





    
        id
        age
        encoded_features
    
    
        820tgsjxq7
        0.0420907775098
        {2: 1, 28: 1}
    
    
        4ft3gnwmtx
        1.33196384402
        {3: 1, 28: 1}
    
    
        bjjt8pjhuk
        0.328729236733
        {3: 1, 28: 1}
    
    
        87mebub9p4
        0.257069621927
        {1: 1, 28: 1}
    
    
        lsw9q7uk0j
        0.615367695957
        {3: 1, 28: 1}
    
    
        0d01nltbrs
        0.687027310763
        {3: 1, 28: 1}
    
    
        a1vcnhxeij
        0.90200615518
        {3: 1, 28: 1}
    
    
        6uh8zyj2gn
        0.615367695957
        {1: 1, 28: 1}
    
    
        yuuqmid2rp
        -0.101228452102
        {3: 1, 28: 1}
    
    
        om1ss59ys8
        0.687027310763
        {3: 1, 28: 1}
    

[124681 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Training a Local Outlier Factor (LOF) Model

Next, we train the LOF model by using this transformed customer_data2 set.



In [17]:

    
model_lof = gl.anomaly_detection.local_outlier_factor.create(customer_data1, 
                                                             features = ['age', 'encoded_features'],
                                                             threshold_distances=True,
                                                             verbose=False)



In [18]:

    
model_lof.save('./model_lof')

model_lof = gl.load_model('./model_lof/')



In [19]:

    
print 'The LOF model has been trained with the following options:'
print '-------------------------------------------------------------'
print model_lof.get_current_options()









    



The LOF model has been trained with the following options:
-------------------------------------------------------------
{'distance': [[['encoded_features'], 'jaccard', 1.0], [['age'], 'euclidean', 1.0]], 'verbose': False, 'num_neighbors': 5, 'threshold_distances': True}

Note that the model can automatically choose a suitable metric for the data type of the features we have available. Here, a composite distance of a 'jaccard' and 'euclidean' metric has been chosen for the 'encoded_features' and the 'age' columns respectively. Both these two metrics have been weighted with 1.0.

If we want what has been built by the model internally we can simply write:



In [20]:

    
print model_lof









    



Class                                   : LocalOutlierFactorModel

Schema
------
Number of examples                      : 124681
Number of feature columns               : 2
Number of neighbors                     : 5
Use thresholded distances               : True
Number of distance components           : 2
Row label name                          : row_id

Training summary
----------------
Total training time (seconds)           : 2467.8927

Accessible fields
-----------------
nearest_neighbors_model                 : Model used internally to compute nearest neighbors.
scores                                  : Local outlier factor for each row in the input dataset.

More importantly, here is the SFrame with the LOF anomaly scores:



In [21]:

    
model_lof['scores']









    Out[21]:





    
        row_id
        density
        anomaly_score
        neighborhood_radius
    
    
        0
        inf
        nan
        0.0
    
    
        1
        inf
        nan
        0.0
    
    
        2
        inf
        nan
        0.0
    
    
        3
        inf
        nan
        0.0
    
    
        4
        inf
        nan
        0.0
    
    
        5
        inf
        nan
        0.0
    
    
        6
        inf
        nan
        0.0
    
    
        7
        inf
        nan
        0.0
    
    
        8
        inf
        nan
        0.0
    
    
        9
        inf
        nan
        0.0
    

[124681 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Firstly, note that the model worked successfully, scoring each of the 124,681 input rows. Secondly, the anomaly score for many observations in our AirBnB dataset is nan which indicates the point has many neighbors at exactly the same location, making the ratio of densities undefined. These points cannot be outliers.

However, for the problem at hand we are interested to find if any outliers exist and under what circumstances this happens. This is where the real business value exists!

Using the LOF Model to detect anomalies

There are two common ways to detect which observations of your data set are anomalous or not:

A. Ask from the trained model to return the k more anomalous observations:

By applying the .topk() method of the model scores SFrame



In [22]:

    
top10_anomalies = model_lof['scores'].topk('anomaly_score', k=10)
top10_anomalies.print_rows(num_rows=10)









    



+--------+---------------+---------------+---------------------+
| row_id |    density    | anomaly_score | neighborhood_radius |
+--------+---------------+---------------+---------------------+
|  787   | 13.9548615034 |      inf      |   0.0716596148059   |
|  3678  | 13.9548615034 |      inf      |   0.0716596148059   |
|  5328  | 1.63764535897 |      inf      |    0.666666666667   |
|  6528  | 2.09512063742 |      inf      |    0.666666666667   |
|  8788  | 13.9548615034 |      inf      |   0.0716596148059   |
|  9626  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10083  |      1.5      |      inf      |    0.666666666667   |
| 10727  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10765  | 13.9548615034 |      inf      |   0.0716596148059   |
| 11038  |      1.5      |      inf      |    0.666666666667   |
+--------+---------------+---------------+---------------------+
[10 rows x 4 columns]

Note that the anomaly scores for these points are infinite, which happens when a point is next to several identical points, but is not itself a member of that bunch. These points are certainly anomalous, but our specific choice of k was arbitrary and excluded many points that are also likely anomalous.

B. Choose a threshold, either from domain knowledge or scientific expertise in order to find the anomalous observations in your data set:

observations with 'anomaly_score' greater than this 'threshold' will be the anomalous ones.

Of course, a closer look at the distribution of the anomaly_scores may help us a lot with this decision.



In [23]:

    
anomaly_scores_sketch = model_lof['scores']['anomaly_score'].sketch_summary()
print anomaly_scores_sketch









    



+--------------------+--------+----------+
|        item        | value  | is exact |
+--------------------+--------+----------+
|       Length       | 124681 |   Yes    |
|        Min         | 0.865  |   Yes    |
|        Max         |  inf   |   Yes    |
|        Mean        |  nan   |   Yes    |
|        Sum         |  inf   |   Yes    |
|      Variance      |  nan   |   Yes    |
| Standard Deviation |  nan   |   Yes    |
|  # Missing Values  |   0    |   Yes    |
|  # unique values   |  580   |    No    |
+--------------------+--------+----------+

Most frequent items:
+-------+-----+-----+----------------+------+----------------+----------------+
| value | 1.0 | inf | 0.966666666667 | 1.08 | 0.933333333333 | 0.942857142857 |
+-------+-----+-----+----------------+------+----------------+----------------+
| count | 643 | 193 |       29       |  26  |       19       |       18       |
+-------+-----+-----+----------------+------+----------------+----------------+
+----------------+------+------+------+
| 0.885714285714 | 1.16 | 1.07 | 1.11 |
+----------------+------+------+------+
|       15       |  14  |  12  |  11  |
+----------------+------+------+------+

Quantiles: 
+-------+----------------+----------------+-----+-----+---------------+-----+
|   0%  |       1%       |       5%       | 25% | 50% |      75%      | 95% |
+-------+----------------+----------------+-----+-----+---------------+-----+
| 0.865 | 0.885714285714 | 0.936666666667 | 1.0 | 1.0 | 1.21333333333 | inf |
+-------+----------------+----------------+-----+-----+---------------+-----+
+-----+------+
| 99% | 100% |
+-----+------+
| inf | inf  |
+-----+------+



In [24]:

    
threshold = anomaly_scores_sketch.quantile(0.9)
anomalies_mask = model_lof['scores']['anomaly_score'] >= threshold
anomalies = model_lof['scores'][anomalies_mask]
print 'Threshold: %.5f' % threshold, '\nNumber of Anomalies: %d' % len(anomalies)









    



Threshold: inf 
Number of Anomalies: 193



In [25]:

    
anomalies.print_rows(num_rows=10)









    



+--------+---------------+---------------+---------------------+
| row_id |    density    | anomaly_score | neighborhood_radius |
+--------+---------------+---------------+---------------------+
|  787   | 13.9548615034 |      inf      |   0.0716596148059   |
|  3678  | 13.9548615034 |      inf      |   0.0716596148059   |
|  5328  | 1.63764535897 |      inf      |    0.666666666667   |
|  6528  | 2.09512063742 |      inf      |    0.666666666667   |
|  8788  | 13.9548615034 |      inf      |   0.0716596148059   |
|  9626  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10083  |      1.5      |      inf      |    0.666666666667   |
| 10727  | 13.9548615034 |      inf      |   0.0716596148059   |
| 10765  | 13.9548615034 |      inf      |   0.0716596148059   |
| 11038  |      1.5      |      inf      |    0.666666666667   |
+--------+---------------+---------------+---------------------+
[193 rows x 4 columns]

Finally, we can filter out the customer_data set by the anomalies['row_id'] to obtain the original features of these anomalous data points in record.



In [26]:

    
customer_data = customer_data.add_row_number(column_name='row_id')
anomalous_customer_data = customer_data.filter_by(anomalies['row_id'], 'row_id')
anomalous_customer_data.print_rows(num_rows=200)









    



+--------+------------+-----------+-------+----------+
| row_id |     id     |   gender  |  age  | language |
+--------+------------+-----------+-------+----------+
|  787   | w6i3ix717s |   OTHER   |  36.0 |    en    |
|  3678  | jwzspk0ipl |    MALE   |  39.0 |    zh    |
|  5328  | eqsihtnz34 |   FEMALE  |  36.0 |    hu    |
|  6528  | dyu0sssqo5 | -unknown- |  47.0 |    nl    |
|  8788  | 91vfcvol82 |    MALE   |  91.0 |    en    |
|  9626  | t6fvmrna0t |    MALE   |  98.0 |    en    |
| 10083  | n45ipduv9i |    MALE   |  28.0 |    fi    |
| 10727  | 9zhr7vpciy |    MALE   |  39.0 |    fr    |
| 10765  | lerui8bp4h |   FEMALE  |  88.0 |    en    |
| 11038  | h0cf46ubyt |    MALE   |  27.0 |    fi    |
| 12293  | unnvgq3efo |    MALE   |  40.0 |    pl    |
| 13926  | 1yoqktv6n6 |   OTHER   |  36.0 |    en    |
| 13980  | 2a9z5icq6y |    MALE   |  39.0 |    de    |
| 14044  | oyr9d8w1ig |   OTHER   |  39.0 |    en    |
| 15897  | lqf1twcvos |    MALE   | 101.0 |    en    |
| 17154  | c817bnjsp4 |    MALE   |  34.0 |    th    |
| 18260  | u0c8pp8dow |   FEMALE  |  19.0 |    de    |
| 18774  | t5g7yx6sks |   FEMALE  |  27.0 |    sv    |
| 19324  | y5l9io8veg |    MALE   |  91.0 |    en    |
| 20849  | n8f00fxpav |   OTHER   |  39.0 |    en    |
| 21232  | 0xrkw4fyw2 |   FEMALE  |  33.0 |    ru    |
| 21279  | 2scyrludwh |    MALE   |  98.0 |    en    |
| 21557  | kfdibfstle |   FEMALE  |  27.0 |    sv    |
| 22354  | 4ha5obt82l |    MALE   |  35.0 |    it    |
| 22816  | pscm1xlz37 |    MALE   |  91.0 |    en    |
| 24235  | 2rvp3se9j9 |   FEMALE  |  34.0 |    de    |
| 25081  | p7puqntomm |    MALE   |  40.0 |    ko    |
| 25101  | 2g6hrnhnb6 |    MALE   |  94.0 |    en    |
| 26681  | 6wf96zdcvk |    MALE   |  98.0 |    en    |
| 26786  | p21uet4l05 |   FEMALE  |  26.0 |    ru    |
| 26962  | bj5g8lixyq |   FEMALE  |  88.0 |    en    |
| 27673  | rzsz6n5i05 |   FEMALE  |  34.0 |    de    |
| 28426  | uipdrp7drt |   OTHER   |  44.0 |    en    |
| 29007  | nnqjh2u2re |    MALE   |  39.0 |    de    |
| 30375  | ko40ov8fsf |   FEMALE  |  35.0 |    el    |
| 30737  | q5ljs1sqyq |    MALE   |  39.0 |    de    |
| 30740  | 5msz5ddlxi |    MALE   |  35.0 |    it    |
| 30777  | ho9ag8jmbi |    MALE   |  94.0 |    en    |
| 31305  | 7ipdgkwscn |    MALE   |  23.0 |    de    |
| 31616  | qf21bmidle |   FEMALE  |  27.0 |    de    |
| 31986  | 8vn7n6732o |    MALE   |  36.0 |    es    |
| 32879  | 30mrn0pd33 |    MALE   |  25.0 |    ru    |
| 33724  | mh7ykpn147 |    MALE   | 101.0 |    en    |
| 34950  | bfzxxhwhni |   FEMALE  |  36.0 |    hu    |
| 36948  | iknls6q14s |    MALE   |  39.0 |    fr    |
| 38244  | dmccxwu3sl |   FEMALE  |  64.0 |    de    |
| 40214  | 9uu7cyhq1v |    MALE   |  25.0 |    ru    |
| 41241  | bpmp74acdf |   FEMALE  |  32.0 |    th    |
| 42377  | 4s5y11lmre |    MALE   |  40.0 |    ko    |
| 42546  | rxfwhr3158 |   OTHER   |  44.0 |    en    |
| 43225  | irj9i2n0nx |   FEMALE  |  22.0 |    sv    |
| 44091  | mx2eqy1cpk |    MALE   | 108.0 |    en    |
| 45005  | mvqqbw343t |   OTHER   |  36.0 |    en    |
| 45033  | iudbydxi6o |    MALE   |  39.0 |    zh    |
| 45047  | j24sw5v572 |   FEMALE  |  88.0 |    en    |
| 46339  | s61x07vvsg |    MALE   |  94.0 |    en    |
| 46375  | roy5fk3ez5 |    MALE   |  27.0 |    ru    |
| 46502  | pv1wki3itf | -unknown- |  15.0 |    en    |
| 47261  | 98w8a1t65x |   FEMALE  |  33.0 |    ru    |
| 47336  | k1g3iax7gd |    MALE   |  92.0 |    en    |
| 47738  | l2hbp7tg6g |    MALE   |  98.0 |    en    |
| 47838  | y8jm031su3 |    MALE   | 101.0 |    en    |
| 47886  | p69kdgrl2g |    MALE   |  39.0 |    de    |
| 47904  | a87wnoi5u7 |    MALE   |  39.0 |    zh    |
| 48480  | 874fy2hc0v |   FEMALE  |  40.0 |    fr    |
| 48796  | wc93nt5vok |    MALE   |  35.0 |    it    |
| 49777  | wc8bb71jnw |    MALE   |  25.0 |    ru    |
| 50889  | 7zrwfyh8yy |    MALE   |  35.0 |    it    |
| 51023  | fqeav2qj5i |   FEMALE  |  22.0 |    hu    |
| 51588  | f2hj5gw4c0 |   FEMALE  |  22.0 |    de    |
| 52592  | fq2x6nalo9 |   FEMALE  |  88.0 |    en    |
| 53554  | 0n07xj0qlx |    MALE   |  92.0 |    en    |
| 54821  | yiqo26yodm |   FEMALE  |  40.0 |    fr    |
| 55830  | n6ugg334eg |    MALE   |  71.0 |    fr    |
| 57747  | h6k3524pqn | -unknown- |  26.0 |    zh    |
| 58518  | mph6ldpg5p |   FEMALE  |  26.0 |    ru    |
| 59018  | xgg4udkocy |    MALE   | 108.0 |    en    |
| 59040  | axg2e2tgf4 |   FEMALE  |  19.0 |    de    |
| 59399  | z6qfisrre9 |   OTHER   |  39.0 |    en    |
| 59484  | unciu6n9s1 |   OTHER   |  51.0 |    en    |
| 60707  | mxgy2hi8lc |    MALE   |  26.0 |    cs    |
| 61707  | 9rw3sypabc |    MALE   |  23.0 |    de    |
| 63081  | u00c0qlf1o |   FEMALE  |  22.0 |    de    |
| 63436  | az9pobqi2g |   FEMALE  |  34.0 |    de    |
| 63895  | kr0l5h8j4i |    MALE   |  98.0 |    en    |
| 64820  | dtrg98vt27 |   FEMALE  |  40.0 |    fr    |
| 65385  | zm5p80tzgu |    MALE   |  39.0 |    fr    |
| 66865  | x66sn9ndsy |    MALE   |  27.0 |    th    |
| 67077  | 14z4t55a8l |    MALE   |  36.0 |    es    |
| 67979  | qzepyxvw0d |   FEMALE  |  34.0 |    de    |
| 69324  | uixx9403eo |   FEMALE  |  27.0 |    sv    |
| 69404  | uotgs2tnr1 |    MALE   | 108.0 |    en    |
| 71644  | dsr06wqj6v |    MALE   | 108.0 |    en    |
| 71976  | 877hi481jr |   FEMALE  |  26.0 |    ru    |
| 72079  | dj9v652mbc |   FEMALE  |  25.0 |    el    |
| 72318  | 6i55c93kup |    MALE   |  40.0 |    ko    |
| 72700  | sa94ou8bok | -unknown- |  34.0 |    fr    |
| 72912  | co08lr3pn3 |   FEMALE  |  37.0 |    ko    |
| 73235  | 1s3cid1010 |    MALE   |  28.0 |    th    |
| 73627  | m4dfh5jm3v | -unknown- |  15.0 |    en    |
| 73968  | zkwakc1i08 |    MALE   |  42.0 |    ko    |
| 75053  | 29ql6k9he8 |   OTHER   |  51.0 |    en    |
| 75421  | ry4uemic5o |    MALE   |  39.0 |    fr    |
| 76863  | 1zds91p8m9 |    MALE   |  46.0 |    da    |
| 77000  | vio6n04q46 |   FEMALE  |  45.0 |    da    |
| 77838  | fwrysylzt1 |    MALE   |  92.0 |    en    |
| 78156  | bt91ucqv6m |    MALE   |  91.0 |    en    |
| 79167  | xy8qb61c1r |   FEMALE  |  26.0 |    ru    |
| 79375  | puvgvvu6xs |    MALE   |  35.0 |    it    |
| 79592  | 0y57yi1sc1 |   FEMALE  |  41.0 |    es    |
| 80050  | 2flgaa2ub3 |   FEMALE  |  37.0 |    el    |
| 81516  | 4ewysjr8sp |   FEMALE  |  26.0 |    ru    |
| 81853  | j4lbte90v8 |   OTHER   |  51.0 |    en    |
| 82867  | erlamwz51w |    MALE   |  23.0 |    de    |
| 83529  | nxx0y70s3h |    MALE   |  27.0 |    ru    |
| 85076  | jqry7hv8us |   FEMALE  |  30.0 |    fi    |
| 86261  | fxk0o2piqx |    MALE   |  42.0 |    ko    |
| 86618  | 8dwjr6bhnq |   FEMALE  |  22.0 |    de    |
| 87586  | lql3y0u1u3 |   FEMALE  |  41.0 |    es    |
| 87860  | xft8nyld5s | -unknown- |  20.0 |    it    |
| 88014  | 00fn6wu77e |   FEMALE  |  27.0 |    de    |
| 88022  | r13gwqsa66 |   OTHER   |  44.0 |    en    |
| 88066  | mkd8qo897u |   OTHER   |  51.0 |    en    |
| 88603  | w68eu2asd4 |    MALE   |  36.0 |    es    |
| 88918  | l1s2ftl7hf |    MALE   |  27.0 |    ru    |
| 89327  | bhvr1c7q4k |   FEMALE  |  27.0 |    de    |
| 91202  | wox91filok |   FEMALE  |  27.0 |    sv    |
| 91279  | y00cramb1g |    MALE   |  44.0 |    el    |
| 91418  | 18g7gss5n9 | -unknown- |  31.0 |    ru    |
| 91446  | y3nndt4alt |    MALE   |  27.0 |    ru    |
| 92081  | rjynx8s8g8 |   OTHER   |  44.0 |    en    |
| 92603  | 2szv3d907w |    MALE   |  23.0 |    de    |
| 94586  | zl1qmq4qa7 |    MALE   |  20.0 |    id    |
| 94819  | 6rzjma0vmq |    MALE   |  35.0 |    cs    |
| 95037  | 0vhiuvupgj |    MALE   |  17.0 |    it    |
| 95115  | 1jmue0ct0g |   FEMALE  |  22.0 |    de    |
| 96202  | dolc5wcp15 |   FEMALE  |  35.0 |    th    |
| 96226  | n1d9o1nz58 |   FEMALE  |  31.0 |    th    |
| 96494  | uygi2kex5w |    MALE   | 101.0 |    en    |
| 96646  | 2mkf8vu4v4 |    MALE   |  92.0 |    en    |
| 96760  | wd8tpxs87a |    MALE   |  28.0 |    th    |
| 96777  | i9va17b12r | -unknown- |  32.0 |    ru    |
| 97069  | xhunwa3b0l |   FEMALE  |  27.0 |    de    |
| 97173  | wy1bxe3bnb |   FEMALE  |  41.0 |    es    |
| 97419  | nozrl9d436 |    MALE   |  94.0 |    en    |
| 97580  | vy7bf52b63 |    MALE   |  23.0 |    th    |
| 97621  | wyvbii39o1 |    MALE   | 101.0 |    en    |
| 98094  | a7q5y0tfht |   FEMALE  |  41.0 |    es    |
| 99610  | yxeehrr8jv |    MALE   |  36.0 |    es    |
| 100076 | 7gxvw7sflb |   FEMALE  |  62.0 |    ru    |
| 100657 | xxkdrpaffu |    MALE   |  25.0 |    ru    |
| 102474 | yb3c5u8ggx |   FEMALE  |  43.0 |    zh    |
| 103187 | t6bck4li4j |   OTHER   |  39.0 |    en    |
| 103301 | 67af8d7zcj |    MALE   |  92.0 |    en    |
| 103897 | jgx9bdcxih |    MALE   |  91.0 |    en    |
| 104831 | agy23vnsom | -unknown- |  26.0 |    zh    |
| 105084 | 832znjbicy |    MALE   | 108.0 |    en    |
| 105299 | jozq26g7ol |    MALE   |  39.0 |    fr    |
| 105507 | mch7jrjfsy |    MALE   | 104.0 |    it    |
| 106591 | f8v9hjple3 |    MALE   |  39.0 |    zh    |
| 106743 | h8q44jeteh |    MALE   |  71.0 |    ru    |
| 107322 | ozspjod2dy |    MALE   |  40.0 |    ko    |
| 108304 | x4k358fahh |   FEMALE  |  22.0 |    de    |
| 108321 | u5t34yo1gs |   FEMALE  |  88.0 |    en    |
| 108802 | kjv2snx9aj |   OTHER   |  36.0 |    en    |
| 109874 | t7xlg3jg7j |   FEMALE  |  43.0 |    zh    |
| 109906 | pt6f662hxb |    MALE   |  36.0 |    es    |
| 110070 | x447w4khpq |   FEMALE  |  19.0 |    de    |
| 110158 | y9snhnr4gp |    MALE   |  39.0 |    zh    |
| 111490 | 5s7stqy8cc |    MALE   |  43.0 |    tr    |
| 111763 | mty0558lp8 | -unknown- |  42.0 |    ko    |
| 112458 | 8izfetkj1u |    MALE   |  75.0 |    ru    |
| 112911 | cwp2jtnb1a |    MALE   |  27.0 |    ru    |
| 113166 | u6u4qll6wr |   OTHER   |  39.0 |    en    |
| 114022 | ew43x6pikv |   FEMALE  |  25.0 |    hu    |
| 115461 | sbz4jj4dw2 |   FEMALE  |  41.0 |    es    |
| 115566 | gv5bn08cd8 |   FEMALE  |  19.0 |    de    |
| 115668 | nqc3hwerk2 |    MALE   |  94.0 |    en    |
| 117742 | jkywcvqxp0 |   FEMALE  |  34.0 |    de    |
| 117781 | pej0es6pdc |   FEMALE  |  33.0 |    ru    |
| 117899 | xc4k98fhy8 |   FEMALE  |  43.0 |    zh    |
| 118267 | wyczz5vwhe |    MALE   |  25.0 |    ru    |
| 118294 | f5gg5hp4ne |   FEMALE  |  35.0 |    el    |
| 118694 | 1lqc712a7q |    MALE   |  42.0 |    ko    |
| 119047 | rlbey80etn |   OTHER   |  44.0 |    en    |
| 119662 | p6mq3z0m1i |   FEMALE  |  27.0 |    de    |
| 120236 | kzzt7oqzun |   FEMALE  |  43.0 |    zh    |
| 120240 | fkpwhf0xp6 |   FEMALE  |  35.0 |    th    |
| 120801 | yilb3fj7k4 |   FEMALE  |  19.0 |    de    |
| 121962 | 0sdd1z92v2 |    MALE   |  40.0 |    ko    |
| 122634 | kb1ja3xxln |    MALE   |  42.0 |    ko    |
| 123011 | 3737oc2uis |    MALE   |  42.0 |    ko    |
| 124267 | xxhpg5929w |   OTHER   |  36.0 |    en    |
+--------+------------+-----------+-------+----------+
[193 rows x 5 columns]



In [ ]:

id	date_account_created	timestamp_first_active	date_first_booking	gender	age	signup_method	signup_flow
gxn3p5htnn	2010-06-28	20090319043255		-unknown-	None	facebook	0
820tgsjxq7	2011-05-25	20090523174809		MALE	38.0	facebook	0
4ft3gnwmtx	2010-09-28	20090609231247	2010-08-02	FEMALE	56.0	basic	3
bjjt8pjhuk	2011-12-05	20091031060129	2012-09-08	FEMALE	42.0	facebook	0
87mebub9p4	2010-09-14	20091208061105	2010-02-18	-unknown-	41.0	basic	0

language	affiliate_channel	affiliate_provider	first_affiliate_tracked	signup_app	first_device_type	first_browser
en	direct	direct	untracked	Web	Mac Desktop	Chrome
en	seo	google	untracked	Web	Mac Desktop	Chrome
en	direct	direct	untracked	Web	Windows Desktop	IE
en	direct	direct	untracked	Web	Mac Desktop	Firefox
en	direct	direct	untracked	Web	Mac Desktop	Chrome

id	age	encoded_features
820tgsjxq7	0.0420907775098	{2: 1, 28: 1}
4ft3gnwmtx	1.33196384402	{3: 1, 28: 1}
bjjt8pjhuk	0.328729236733	{3: 1, 28: 1}
87mebub9p4	0.257069621927	{1: 1, 28: 1}
lsw9q7uk0j	0.615367695957	{3: 1, 28: 1}
0d01nltbrs	0.687027310763	{3: 1, 28: 1}
a1vcnhxeij	0.90200615518	{3: 1, 28: 1}
6uh8zyj2gn	0.615367695957	{1: 1, 28: 1}
yuuqmid2rp	-0.101228452102	{3: 1, 28: 1}
om1ss59ys8	0.687027310763	{3: 1, 28: 1}

row_id	density	anomaly_score
0	inf	nan
1	inf	nan
2	inf	nan
3	inf	nan
4	inf	nan
5	inf	nan
6	inf	nan
7	inf	nan
8	inf	nan
9	inf	nan

row_id	density	anomaly_score
0	inf	nan
1	inf	nan
2	inf	nan
3	inf	nan
4	inf	nan
5	inf	nan
6	inf	nan
7	inf	nan
8	inf	nan
9	inf	nan

row_id	density	anomaly_score
0	inf	nan
1	inf	nan
2	inf	nan
3	inf	nan
4	inf	nan
5	inf	nan
6	inf	nan
7	inf	nan
8	inf	nan
9	inf	nan